for loop and while loopA Python script is a file containing Python code that can be executed all at once.
.pygenotype = "AG"
phenotype = "expressed"
if genotype == "AG":
# Only check phenotype if genotype is "AG"
if phenotype == "expressed":
print("Variant is active and expressed.")
else:
print("Variant is inactive.")
else:
print("Non-target variant.")
.py¶.py entension - people and the OS can recognize it. Put some of your Python code you have written yesterday and save it in a Python script, run it in a terminal with either
python yourscript.py
or
./yourscript.py (not required for the Windows user)
yourscript.py with the actual name you use#!/usr/bin/env python at the beginning of the script and also make the script executable in order to run it with the second method (only for Mac or Linux users)file = open("filename.txt", "r")
content = file.read()
file.close()
'r' specifies the mode for opening the file as a read-only text file. If the file does not exist, a FileNotFoundError will be raised.
Other types of mode:
'rb': Opens the file as a binary file for reading.'r+ : Opens the file for reading and writing.Extra reading: https://www.w3schools.com/python/python_file_handling.asp
with keyword to ensure the file handle be closed automatically¶file = open("filename.txt", "r")
content = file.read()
file.close()
with open("filename.txt", "r") as file:
content = file.read()
encoding='utf-8' when opening a file in Python to ensure proper handling of text files that contain non-ASCII characters, such as Chinese characters.¶with open ("filename.txt", "r", encoding='utf-8') as file
content = file.read()
I have a file contains DNA sequences and want to use Python to read its content and then calculate the number of nucleotides, that is, get the length of the dna sequence.
seqfile = "../files/one_dna_sequence.fa"
with open(seqfile, "r") as file:
for line in file:
print(line.strip())
string¶'string'.strip() Removes whitespace
'string'.split() Splits on whitespace into list
string1 = ' an example string with whitespaces at both ends '
string2 = string1.strip()
string2
list1 = string2.split()
list1
list2 = string1.strip().split()
list2
file = open("output.txt", "w")
file.write("Hello, Python!")
file.close()
'w': Opens a file for writing only. If the file does not exist, it creates a new file. If the file exists, its previous content will be truncated, effectively deleting the content before the new write operations.Other modes:
'a': Opens the file for writing, appending any new data you write to the end of the file's existing content, thus preserving the previous content.Extra reading: https://www.w3schools.com/python/python_file_handling.asp
with keyword¶with open("../newproject/output.txt", "w") as file:
file.write("Hello, Python!")
seqfile = "../files/one_dna_sequence.fa"
with open(seqfile, "r") as file:
seqlength = 0
for line in file:
line = line.strip()
if not line.startswith(">"):
seqlength += len(line)
outfile = "../newproject/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
file.write("Length of DNA sequence: " + str(seqlength))
Link: https://python-bioinfo.bioshu.se/exercises.html
You have a VCF file with a number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1,200,000-1,300,000, including the start and end positions). What you want to know is whether this sample has any variants in this region, and if so, which variants.
0/10 means DNA at this position is the same as reference allele (A), 1 means has alternative allele, in this case AC The notation of alternative allele is not always 1
0/22 means it is the second alternative allele, i.e. C in this caseYou have a VCF file with a number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1,200,000-1,300,000, including the start and end positions). What you want to know is whether this sample has any variants in this region, and if so, which variants.
#)5 and position is between 1,200,000 and 1,300,0000/1:30:SM0/1) from the column0/0.#), and print the first record¶vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
for line in fh:
if not line.startswith('#'):
line = line.strip()
print(line)
break
# Next, find chromosome 5
5¶vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
for line in fh:
if not line.startswith('#'):
line = line.strip()
cols = line.split('\t')
chrom = cols[0]
if chrom == '5':
print(cols)
break
# Next, find the correct region
5 and position is between 1,200,000 and 1,300,000¶vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
for line in fh:
if not line.startswith('#'):
line = line.strip()
cols = line.split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and (1200000 <= pos <= 1300000):
print(cols)
break
# Next, find the genotypes for sample1
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
for line in fh:
if not line.startswith('#'):
line = line.strip()
cols = line.split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and (1200000 <= pos <= 1300000):
geno_col = cols[9]
print(geno_col)
break
# Next, extract the genotypes only
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
for line in fh:
if not line.startswith('#'):
line = line.strip()
cols = line.split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and (1200000 <= pos <= 1300000):
geno_col = cols[9]
geno = geno_col.split(':')[0]
print(geno)
break
# Next, find in which positions sample1 has alternate alleles
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
for line in fh:
if not line.startswith('#'):
line = line.strip()
cols = line.split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and (1200000 <= pos <= 1300000):
geno_col = cols[9]
geno = geno_col.split(':')[0]
if geno not in ["0/0"]:
print(geno)
#Next, print nicely
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
for line in fh:
if not line.startswith('#'):
line = line.strip()
cols = line.split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and (1200000 <= pos <= 1300000):
geno_col = cols[9]
geno = geno_col.split(':')[0]
if geno not in ["0/0"]:
ref = cols[3]
alt = cols[4]
info = chrom + ":" + str(pos) + "_" + ref + "-" + alt + " has genotype: "+ geno
print(info)
source: mayoclinic.org
source: cff.org
plaintext
>MT dna:chromosome chromosome:GRCh38:MT:1:16569:1 REF
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
seqname: The name of the sequence (typically a chromosome).source: The source of the annotation (e.g., ENSEMBL).feature: The type of feature (e.g., gene, transcript, exon).start: The starting position of the feature in the sequence.end: The ending position of the feature in the sequence.score: A score between 0 and 1000, or . if not applicable (indicating the reliability of the annotation).strand: The strand on which the feature is located (+ for the forward strand, - for the reverse strand).frame: The reading frame, one of '0', '1' or '2', or . if not applicable.attribute: A list of key-value pairs providing additional information about the feature.Some attributes (always semi-colon ';' separated key-value pairs):
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";